Business Analytics

Advanced Data Wrangling

Ayush Patel and Jayati Sharma

12 February, 2024

Pre-requisite

You already….

  • Know basics of data wrangling in R
  • Know basics of data visualization in R
  • Know univariate and multivariate linear regression

Before we begin

Please install and load the following packages

library(dplyr)
library(tidyverse)
library(openintro)
library(nycflights13)



Access lecture slide from the course landing page

About me

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Learning Objectives

  • Learn how to pivot and join data
  • Learn how to work with missing values
  • Learn how to use row wise functions

Tidy Data

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • Tidy data - a framework for consistent data structure
  • Take a look at table1 table2 and table3, all part of tidyverse
  • All contain the same information but in different ways
table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Tidy Data - Rules

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • Advantages of tidy data structure - Consistent structure and uniformity
  • Variables in columns, observations in rows and values in cells

Source : R for Data Science 2e

First Things First - Let’s Recap

  • Using table1, calculate the total number of cases per year
  • How would you calculate the total number of cases per year with table2

Pivoting - Why?

  • Because data is often in an untidy structure
  • Sometimes you also need to transform data because you need to get it into another specific format
  • Pivot functions enable changing data from one format to another

Pivoting

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • When column names are not variable names but rather values
table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
  • Variables 1999 and 2000 are values of the year variable
  • Values in these two variables represent the cases variable

Pivoting

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • So exactly why do we need to change the structure of the data?
  • Because here, one row represents 2 observations
  • We need to
    • Select the columns that have values instead of variables (1999 and 2000)
    • Variable to move the column names to (year)
    • Variable to move the column values to (cases)
  • All these steps are done together using pivot_longer()

Pivoting - pivot_longer()

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

table4a %>%
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766
  • Converts data into longer format

Pivoting - pivot_wider()

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • pivot_wider() is the opposite of pivot_longer()
  • Used when one observation is scattered along rows
table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
table2 %>%
    pivot_wider(names_from = type, values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Do it Yourself - 1

Run the following code in your console

DIY1 <- data.frame(student_name = c("A","B","C","D","A","B","C","D"),
                   subject = c("Maths","Maths","Maths","Maths","English","English","English","English"),
                   grade = c(78,79,98,87, 77,65,69,80))
  • Is DIY1 already in tidy data structure? If not, how would you change it?
DIY1_1 <- data.frame(student_name = c("A","B","C","D"),
                   P_1 = c(67,87,93,56),
                   P_2 = c(86,57,68,94))
  • DIY1_1 shows the grades of students in class for Periodical 1 and Periodical 2. Is the data in tidy format? If not, how would you transform it?

Joins - Why?

  • More than often, you work with multiple dataframes
  • When you have all your individual dataframes in the required format, you might want to join all these datasets together
  • Join function help in joining datasets by identifying matching observations

Joins - Keys

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • To use join functions, it is important to understad what keys are
  • Every join involves the following keys
    • primary key -variable that uniquely identifies each observation in a dataset
    • when more than one variable is needed, the key is called a compound key
    • foreign key - variable (or set of variables) that corresponds to a primary key in another table

Joins - Primary Key

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

nycflights13::airlines
# A tibble: 16 × 2
   carrier name                       
   <chr>   <chr>                      
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.         
  • airlines has two variables - carrier code and name
  • You can identify an airline with its two letter carrier code, making carrier the primary key
nycflights13::weather
# A tibble: 26,115 × 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
# ℹ 26,105 more rows
# ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#   visib <dbl>, time_hour <dttm>
  • weather records data about the weather at the origin airports
  • You can identify each observation by the combination of location and time
  • Hence, origin and time_hour become the compound primary key

Joins - Foreign Key

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • Foreign Key is a variable in a dataset that corresponds to the primary key of another dataset
nycflights13::flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
nycflights13::planes
# A tibble: 3,322 × 9
   tailnum  year type              manufacturer model engines seats speed engine
   <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
 1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
# ℹ 3,312 more rows
  • flights$tailnum is a foreign key that corresponds to the primary key planes$tailnum

Do It Yourself - 2

  • What is the primary key for airports and planes data from nycflights13?
  • What is the foreign key in flights that corresponds to the primary key in airports?

Joins - Structures

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

Relationships and connections between all datasets in the nycflights13 package

Source : R for Data Science 2e

What are join functions?

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • They take dataset 1 and 2 and join them to give a single dataframe
  • The output of rows and columns is determined by dataset 1, in most cases
  • Six join functions - left_join(),inner_join(),right_join(),full_join(),semi_join() and anti_join()
  • Mutating joins combines variables to produce the resulting dataframe
  • Filtering joins filter the rows from the two datasets to give the resulting dataframe

Dataset

Run the following code chunks in your console

data1 <- data.frame(name = c("A","B","C","D","E","F","G", "H", "I", "J"),
                   ID = c(193,176,6,273,132,182, 1, 42, 67,20),
                   age = c(23,24,32,43,23,54, 23,20,40,53),
                   state = c("MH", "MP", "KA", "MP", "UP", "GJ","MH", "UP", "MH", "TN"))


data2 <- data.frame(id = c(193,176,6,273,132,182, 11, 13, 15),
                   edu_level = c(12,8,10,12,12,0, 2,5,10))

Left Join

  • Suppose you want to add the additional variable of educational level from data2 to existing data1
  • left_join() retains the first dataset (the one you are joining to)
  • join_by() specifies the primary key to be used for joining
  • Used mainly for joining additional variables
  • The output contains rows only from data1 and leaves out additional rows that it does not match to from data2
  • Note: Rows in data1 which are not present in data2 have NA values in edu_level
data1 %>%
  left_join(data2, by = join_by(ID == id))
   name  ID age state edu_level
1     A 193  23    MH        12
2     B 176  24    MP         8
3     C   6  32    KA        10
4     D 273  43    MP        12
5     E 132  23    UP        12
6     F 182  54    GJ         0
7     G   1  23    MH        NA
8     H  42  20    UP        NA
9     I  67  40    MH        NA
10    J  20  53    TN        NA

Right Join

  • Yes, you guessed it right… right_join() does the opposite
  • Suppose we want to join all rows of data2 to all the variables of data1
  • By using a right_join() you have all the rows from data2 joined to data1
  • Rows from data2 which are not in data1 have NA values in remaining columns
data1 %>%
  right_join(data2, by = join_by(ID == id))
  name  ID age state edu_level
1    A 193  23    MH        12
2    B 176  24    MP         8
3    C   6  32    KA        10
4    D 273  43    MP        12
5    E 132  23    UP        12
6    F 182  54    GJ         0
7 <NA>  11  NA  <NA>         2
8 <NA>  13  NA  <NA>         5
9 <NA>  15  NA  <NA>        10

Inner Join

  • But what if you only want to see the rows with complete information? That is, rows common to both datasets
  • inner_join() comes to the rescue
data1 %>%
  inner_join(data2, by = join_by(ID == id))
  name  ID age state edu_level
1    A 193  23    MH        12
2    B 176  24    MP         8
3    C   6  32    KA        10
4    D 273  43    MP        12
5    E 132  23    UP        12
6    F 182  54    GJ         0

Full Join

  • What if you want to keep all the rows from both the datasets
  • We have the full_join()
  • Has NA values where data is missing from data1 and data2
data1 %>%
  full_join(data2, by = join_by(ID == id))
   name  ID age state edu_level
1     A 193  23    MH        12
2     B 176  24    MP         8
3     C   6  32    KA        10
4     D 273  43    MP        12
5     E 132  23    UP        12
6     F 182  54    GJ         0
7     G   1  23    MH        NA
8     H  42  20    UP        NA
9     I  67  40    MH        NA
10    J  20  53    TN        NA
11 <NA>  11  NA  <NA>         2
12 <NA>  13  NA  <NA>         5
13 <NA>  15  NA  <NA>        10

Joins…and some more

  • Often we work with datsets that have many variables
  • While joining, you can make your resulting dataset smaller by using basic dplyr functions
data2 %>%
  right_join(data1 %>% select(name, age, ID), by = join_by(id == ID))
    id edu_level name age
1  193        12    A  23
2  176         8    B  24
3    6        10    C  32
4  273        12    D  43
5  132        12    E  23
6  182         0    F  54
7    1        NA    G  23
8   42        NA    H  20
9   67        NA    I  40
10  20        NA    J  53

Do It Yourself -3

  • From the nycflights13, how can you join flights and airlines
  • Can flights and planes be joined? If yes, how?
  • Do a left join of flights with and planes
  • Now, do a right join of the same
  • What does an inner join of the two look like?
  • Perform a join using flights and airlines such that the resulting data has three variables - carrier , name and origin

Semi Join

  • Type of filtering join
  • Keeps rows only from data1 that have a match in data2
  • For better understanding, compare output from semi join and inner join
data1 %>%
  semi_join(data2 , by = join_by(ID == id))
  name  ID age state
1    A 193  23    MH
2    B 176  24    MP
3    C   6  32    KA
4    D 273  43    MP
5    E 132  23    UP
6    F 182  54    GJ
data1 %>%
  inner_join(data2 , by = join_by(ID == id))
  name  ID age state edu_level
1    A 193  23    MH        12
2    B 176  24    MP         8
3    C   6  32    KA        10
4    D 273  43    MP        12
5    E 132  23    UP        12
6    F 182  54    GJ         0

Anti Join

  • Suppose you want to see the rows that do NOT match with the second dataset
  • Anti joins helps identify rows that do not have matching data
  • Can you guess what the ouput would look like when you use anti join data1 with data2
data1 %>%
  anti_join(data2 , by = join_by(ID == id))
  name ID age state
1    G  1  23    MH
2    H 42  20    UP
3    I 67  40    MH
4    J 20  53    TN

Do It Yourself -4

  • Do a semi join of flights with airlines
  • How many observations in flights are NOT there in planes dataset? Check using an appropriate join function

Data structure ready…what next?

  • You got your data in tidy format
  • You joined all such necessary datasets to get to the final dataset
  • We will now learn more functions to perfom on the final dataset, for data cleaning and analysis

Missing Values

  • A very important feature of data analysis: handling missing values
  • Often, data won’t be with all the values
  • While handling missing values : context matters!
  • Appropriate methods for handling missing values should be used only when you know the reason for the absence

Load the following dataset

hospital_visits <- data.frame(name = c("John", NA, NA, NA, "Dave", NA, NA, "Travis", NA, NA),
                              year = c(2016, 2017, 2018, 2019, 2016, 2017, 2018, 2016, 2017, 2018),
                              number_of_visits = c(8, 10, 12, NA, 10, 5, NA , 5, 7, NA))

Missing Values

Content for this topic has been sourced from Hadley Wickham’s ‘R for Data Science (2e)’. Please check out his work for detailed information.

  • Take a look at hospital_visits
  • The first column name has NA values
  • In this context, missing values indicate that the value in the previous row has been repeated
  • The missing values can be filled using tidyr::fill()
hospital_visits %>%
  fill(name)
     name year number_of_visits
1    John 2016                8
2    John 2017               10
3    John 2018               12
4    John 2019               NA
5    Dave 2016               10
6    Dave 2017                5
7    Dave 2018               NA
8  Travis 2016                5
9  Travis 2017                7
10 Travis 2018               NA
  • In other contexts, missing values represent some fixed and known value, most commonly 0
  • Look at the number_of_visits column, here NA means 0 visits
  • dplyr::coalesce() can be used to replace them
hospital_visits$number_of_visits <- coalesce(hospital_visits$number_of_visits, 0)
hospital_visits
     name year number_of_visits
1    John 2016                8
2    <NA> 2017               10
3    <NA> 2018               12
4    <NA> 2019                0
5    Dave 2016               10
6    <NA> 2017                5
7    <NA> 2018                0
8  Travis 2016                5
9    <NA> 2017                7
10   <NA> 2018                0

Dropping Missing Values

  • In other cases, NA values might be present because data was not collected/missing
  • In such cases, you might want to remove the NA values
  • Suppose there are some NA values in the year variable
hospital_visits_year <- data.frame(name = c("John", NA, NA, NA, "Dave", NA, NA, "Travis", NA, NA),
                              year = c(2016, 2017, NA, 2019, 2016, NA, 2018, NA, 2017, 2018),
                              number_of_visits = c(8, 10, 12, NA, 10, 5, NA , 5, 7, NA))
  • In hospital_visits, we do not know what NA values in year and want to remove these observations fully
  • drop_na() drops NA values from the year variable
hospital_visits_year %>%
  drop_na(year)
  name year number_of_visits
1 John 2016                8
2 <NA> 2017               10
3 <NA> 2019               NA
4 Dave 2016               10
5 <NA> 2018               NA
6 <NA> 2017                7
7 <NA> 2018               NA
  • Alternatively, you can also drop NA from the entire dataset using drop_na()
hospital_visits_year %>%
  drop_na()
  name year number_of_visits
1 John 2016                8
2 Dave 2016               10

Do It Yourself - 5

Run the following in your console

datasets::airquality
  • In the Ozone variable, replace all NA values with 0
  • Drop all observations from the dataset where Solar.R is NA

More Functions - slice()

  • Now you have the dataset with no NA values using
slice_data <- hospital_visits %>%
  fill(name)

slice_data$number_of_visits <- coalesce(slice_data$number_of_visits,0)
  • Suppose you want to see the only have the years where all people had the most maximum hospital visits
  • Doing this might seem tedious, since every individual has multiple entries and different year for maximum hospital visits
  • Thankfully, we have the slice() function

More Functions - slice()

Content for this topic has been sourced from dplyr. Please check out the work for detailed information.

  • slice() allows indexing rows by their locations
  • slice(1) would show the observation at the first position i.e. first observation
slice_data %>%
  slice(1)
  name year number_of_visits
1 John 2016                8
  • Similarly, slice_min() would give the minimum value
slice_data %>%
  slice_max(number_of_visits)
  name year number_of_visits
1 John 2018               12

More Functions - slice()

  • Let us come back now to our original question; highest number of hospital visits for all individuals
  • We need the maximum count AND we need that for every individual
  • Using a group_by , arrange() and then the slice() function would mean that it will return the first position for each group
slice_data %>%
  group_by(name) %>%
  arrange(desc(number_of_visits)) %>%
  slice(1)
# A tibble: 3 × 3
# Groups:   name [3]
  name    year number_of_visits
  <chr>  <dbl>            <dbl>
1 Dave    2016               10
2 John    2018               12
3 Travis  2017                7

More Functions - rowwise()

  • Till now, we learnt functions that work for columns, like select, filter, mutate, slice etc.
  • Think of group_by, it performs functions based on groups
  • But what if you want to perform functions across rows?
  • rowwise() helps in aggregating and performing functions on rows

Run the following in your console

row_data <- data.frame(id = c(1,2,3,4,5,6),
                       income_source1 = c(10,15,12,14,20,18),
                       income_source2 = c(12,14,11,16,25,15),
                       income_source3 = c(20,25,22,24,30,28))

More Functions - rowwise()

Content for this topic has been sourced from dplyr. Please check out the work for detailed information.

  • You want to calculate the mean income that an indivial earns from all 3 sources

  • This means mean of 3 rows for each individual

  • Take a look at the difference between the two outputs

  • Using only mutate() computes the mean of the three variables across all rows

row_data %>%
  mutate(mean_income = mean(c(income_source1, income_source2, income_source3)))
  id income_source1 income_source2 income_source3 mean_income
1  1             10             12             20    18.38889
2  2             15             14             25    18.38889
3  3             12             11             22    18.38889
4  4             14             16             24    18.38889
5  5             20             25             30    18.38889
6  6             18             15             28    18.38889
  • Using rowwise() gives mean of three variables for each row
row_data %>%
  rowwise() %>%
  mutate(mean_income = mean(c(income_source1, income_source2, income_source3)))
# A tibble: 6 × 5
# Rowwise: 
     id income_source1 income_source2 income_source3 mean_income
  <dbl>          <dbl>          <dbl>          <dbl>       <dbl>
1     1             10             12             20        14  
2     2             15             14             25        18  
3     3             12             11             22        15  
4     4             14             16             24        18  
5     5             20             25             30        25  
6     6             18             15             28        20.3

More Functions - c_across()

Content for this topic has been sourced from dplyr. Please check out the work for detailed information.

  • Often, your dataset will have many variables
  • It is quite tedious to type in all the variables
  • c_across to the rescue!
  • It uses tidy selection syntax for selecting many variables
row_data %>%
  rowwise() %>%
  mutate(total_income = sum(c_across(income_source1 : income_source3)))
# A tibble: 6 × 5
# Rowwise: 
     id income_source1 income_source2 income_source3 total_income
  <dbl>          <dbl>          <dbl>          <dbl>        <dbl>
1     1             10             12             20           42
2     2             15             14             25           54
3     3             12             11             22           45
4     4             14             16             24           54
5     5             20             25             30           75
6     6             18             15             28           61
row_data %>%
  rowwise() %>%
  mutate(total_income = rowSums(pick(where(is.numeric), -id)))
# A tibble: 6 × 5
# Rowwise: 
     id income_source1 income_source2 income_source3 total_income
  <dbl>          <dbl>          <dbl>          <dbl>        <dbl>
1     1             10             12             20           42
2     2             15             14             25           54
3     3             12             11             22           45
4     4             14             16             24           54
5     5             20             25             30           75
6     6             18             15             28           61

Do It Yourself - 6

  • Use yrbss data from openintro package
  • Slice the data in such a way to see the 157th observation
  • Slice the data in such a way to see the 10th to 15th observations
  • Slice the data by grade, in order to see the first observation of one person from each grade
  • Calculate the mean of physically_active_7d and strength_training_7d for each individual

Thank You :)